Optimizing Energy Efficiency in Modern
Computing Systems through Advanced
Architectural Techniques
by Yu-Yang Ting and Wei Chen
Introducon
In the contemporary era of rapid technological advancement, opmizing energy
eciency in modern compung systems has emerged as a crical challenge and an area
of intensive research. The push towards more sustainable and energy-ecient
compung pracces is driven by the exponenal growth in data processing needs, the
proliferaon of cloud compung, and the ubiquitous deployment of internet-connected
devices. The quest for minimizing energy consumpon while maximizing performance
has led to the exploraon of advanced architectural techniques that promise signicant
improvements in energy eciency. This literature review aims to dissect the myriad of
strategies and innovaons in computer architecture that contribute to energy
conservaon, ranging from novel processing units and memory management techniques
to predicve models and energy-aware algorithms. By synthesizing ndings from cung-
edge research, including studies on Non-Uniform Memory Access (NUMA) opmizaons,
Simultaneous Multhreading (SMT) enhancements, dynamic voltage and frequency
scaling (DVFS), and the implementaon of machine learning algorithms for energy
predicon, this review provides a comprehensive overview of the state-of-the-art
approaches to energy-ecient compung. Through the lens of recent advancements
and empirical studies, we aim to highlight the eecveness of these architectural
innovaons, idenfy prevailing challenges, and propose future direcons for research in
opmizing energy eciency in modern compung systems.
Methodology
Our methodology for the literature review on opmizing energy eciency in modern
compung systems encompasses a systemac search across academic databases,
applying inclusion and exclusion criteria to select relevant studies, extracng and
assessing data from these studies, synthesizing ndings to idenfy trends and gaps, and
analyzing the impact of architectural techniques on energy eciency. This approach
ensures a comprehensive and crical examinaon of current research in the eld.
Result
In this secon, we delve into the mulfaceted strategies and innovaons that have been
explored to enhance energy eciency in modern compung systems through advanced
architectural techniques. Our discussion encompasses a range of approaches from
hardware opmizaons, such as Non-Uniform Memory Access (NUMA) adjustments and
Simultaneous Multhreading (SMT) enhancements, to soware intervenons like
dynamic voltage and frequency scaling (DVFS) and predicve modeling. Each strategy is
examined for its potenal to reduce energy consumpon while maintaining or improving
performance, illustrang the complexity and necessity of cross-disciplinary eorts in
achieving sustainable compung pracces.
NUMA opmizaon
The arcle "Energy-ecient I/O Thread Schedulers for NVMe SSDs on NUMA" [1]
presents a thorough invesgaon into the performance and energy eciency of using a
large number of parallel I/O threads. This study highlights that when managing I/O
operaons on Non-Uniform Memory Access (NUMA) architectures with NVMe Solid
State Drives (SSDs), the impact of CPU contenon—where mulple processes compete
for CPU me—is signicantly less detrimental than the impact of remote access. Remote
access refers to the process of accessing data from SSDs that are not directly connected
to the CPU aempng the access, which is a common scenario in NUMA systems. These
ndings suggest that opmizing for reduced remote access to NVMe SSDs could lead to
more substanal performance and energy eciency improvements than merely focusing
on minimizing CPU contenon in systems employing numerous parallel I/O threads.
So, they present a new algorithm on NVMe SSDs on NUMA - ENERGY-EFFICIENT I/O
SCHEDULER(ESN)
Background-NUMA
In the class, we have learned about Symmetric Mulprocessing (SMP). SMP involves a
system where all CPUs share the same memory resources via the same memory bus,
ensuring equal access speeds for all processors, hence the term "symmetric." However,
as the number of CPUs increases, memory access conicts also rise, leading to a rapid
decline in the eciency and performance of the CPUs. On the other hand, Non-Uniform
Memory Access (NUMA) divides CPUs into mulple nodes, with each node possessing its
own independent memory space, allowing for high-speed interconnect communicaon
between nodes. In NUMA architectures, the speed at which a CPU accesses memory
varies depending on the node; accessing local node memory is fastest, while accessing
remote node memory is slower, with speed decreasing as the distance increases. This
design inherently suers from slower access speeds when a node's memory is
insucient and data must be retrieved from a remote node.
Figure 1 SMP vs NUMA
NVMe SSDs on NUMA
Figure 2 NVMe SSDs on NUMA
In the nuanced landscape of NUMA (Non-Uniform Memory Access) architectures, the
placement of NVMe (Non-Volale Memory express) Solid-State Drives (SSDs) becomes
pivotal to system performance. Referencing the visual aids provided in the gures, we
explore three sengs: The rst shows a single SSD coupled to one CPU node (gure 3),
potenally decreasing latency and memory contenon through direct local access. The
second seng involves a dual-SSD setup, each connected to individual CPU nodes
(gure 5), harnessing parallelism and diminishing memory trac across nodes, which is
benecial for high-throughput demands. The third seng depicts two SSDs connected to
a single CPU node (gure 4), which might enhance storage bandwidth at the cost of
increasing contenon for memory access. As the gures suggest, remote memory
access, a fundamental aspect of NUMA systems, oen entails augmented latency and
access disputes. Consequently, the strategic arrangement of SSDs relave to CPU nodes
is a crical consideraon, with the goal of reducing the detriments of remote access,
thereby underlining the importance of tailored SSD sengs to enhance the eciency
and speed of the overall system.
Figure 3 Single SSD NUMA seng [1]
Figure 4 Dual SSDs NUMA seng 1 [1]
Figure 5 Dual SSDs NUMA seng 2 [1]
Expanding upon the insights presented in Figure 6, it becomes apparent that system
throughput experiences a decline when the number of I/O threads surpasses the
threshold of 512 in Seng 1 of Conguraon 1. This specic seng involves a solitary
SSD bound to a single CPU node. The degradaon in performance at this juncture can be
primarily aributed to a heightened contenon penalty, which, in this scenario, imposes
a more substanal boleneck than the penalty incurred through remote access. The
contenon penalty arises as mulple I/O threads vie for access to the limited resources
of the single SSD, resulng in an increased number of conicts and, consequently,
latency. This outcome underlines the intricate balance between resource allocaon and
scalability within NUMA architectures. It emphasizes the need for meculous planning in
the architectural design phase to accommodate the ancipated load and access
paerns, thereby ensuring that the system is not only well-tuned for current demands
but also scalable for future increases in load without signicant performance trade-os.
Figure 6 single SSD on NUMA benchmark
Turning our aenon to the dual-SSD NUMA conguraons across dierent sengs, a
disnct divergence in performance becomes evident. As illustrated in the data, Seng 1
achieves a superior maximum throughput, approximately 1600 IOPS, whereas Seng 2
peaks at merely 1200 IOPS. The crical factor contribung to this discrepancy is the
unique conguraon of Seng 2, where both SSDs are aliated with a singular CPU
node. Although this node has the advantage of localized access to the data stored within
the SSDs, it is simultaneously hampered by the dual penales of contenon and remote
access. Contenon arises as the two SSDs on the same node compete for bandwidth,
while other nodes incur a performance hit when aempng to access these SSDs
remotely, leading to a compounded delay. In contrast, Seng 1, by distribung the SSDs
across two CPU nodes, alleviates such contenon and minimizes remote access
requirements, thus facilitang a higher throughput. This comparison underscores the
necessity of considering both the placement of storage resources and the intended
access paerns to opmize NUMA system performance comprehensively.
Figure 7 Seng 2 benchmark [1]
Figure 8 seng 1 benchmark [1]
Linux default scheduler-CFS
In the referenced arcle [1], the "default scheduler" denotes the Linux kernel's default
I/O scheduler, known as the Completely Fair Scheduler (CFS). The CFS is designed to
allocate me slices for process execuon in a manner that aims to provide equitable CPU
me to running processes, thereby ensuring a balanced distribuon of compung
resources. This arcle delves into the performance metrics of I/O operaons within
systems employing a single NVMe SSD and notes that prior assessments of I/O
performance have predominantly been under condions where the CFS is operaonal.
CFS plays a pivotal role in managing how tasks are priorized and executed, extending its
inuence to the I/O performance by managing access to the storage subsystem. In the
context of NUMA systems, where mulple processors are involved, CFS must also
account for the addional complexity introduced by remote accesses—where data must
be fetched across dierent memory nodes. The scheduler thus faces the dual challenge
of managing equitable CPU access (its primary funcon) while also navigang the
penales associated with remote storage access. These penales can exacerbate the
latency issues, especially when numerous I/O threads concurrently request access to
storage resources that are not locally available.
Therefore, the scheduler's behavior can signicantly impact system performance,
parcularly in NUMA conguraons. Understanding and potenally opmizing the
default scheduler's strategy for handling I/O requests is crucial for enhancing overall
system eciency, parcularly in sengs with high I/O demands and in systems that are
sensive to latency variaons due to remote access penales.
Energy-ecient I/O scheduler (ESN)
Figure 9 ESN algorithm
The ESN (Energy-ecient Scheduler for NUMA systems) algorithm presents a set of
enhancements tailored for the unique demands of NUMA architectures. It focuses on
distribung loads equitably across nodes to avert bolenecks, thereby opmizing
resource ulizaon. By priorizing local memory access, ESN reduces the latency
typically associated with remote accesses, leading to speedier data retrieval and
diminishing energy expenditure. The algorithm also curtails resource contenon, which
is essenal for maintaining consistent performance levels, especially under the strain of
mulple concurrent processes.
Furthermore, ESN showcases impressive scalability. It adapts to the demands of
escalang parallel processes without the necessity for manual intervenon, a crical
feature for dynamic high-performance compung environments. In terms of energy
conservaon, ESN strategically aligns I/O threads with CPU sockets, taking into account
both the number of processes and the locaon of memory accesses. This alignment
ensures opmal system performance while concurrently reducing power consumpon
by perming the idling of CPUs not engaged in acve tasks. The core objecve of ESN is
thus to intelligently map an increased number of I/O threads to the same CPU, while
judiciously balancing performance consideraons. The culminaon of these mapped
threads allows for the potenal downscaling of energy usage by placing unused CPUs in
a low-power state, thereby enhancing the overall energy eciency of the system.
Improvement
Figures 10 through 13 provide a comparave performance analysis between the ESN
scheduler and the tradional CFS scheduler for systems ulizing a single NVMe SSD.
These gures highlight that the ESN scheduler not only preserves a similar level of
throughput to that of the CFS but also excels in diminishing latency. Moreover, ESN
demonstrates a notable reducon in energy consumpon, an advantage achieved by its
ability to idle CPUs when they are not in acve use. This capability of ESN to maintain
operaonal eciency while enhancing power management underscores its suitability for
energy-conscious compung environments.
Figure 10 CFS vs ESN [1]
Figure 11 CFS vs ESN [1]
Figure 12 CFS vs ESN [1]
Figure 13 CFS vs ESN [1]
Discussion & Conclusion
References
[1]
H. J. W. S.-a. a. S. S. J. Qian, Energy-ecient I/O Thread Schedulers for NVMe SSDs
on NUMA, 2017 17th IEEE/ACM Internaonal Symposium on Cluster, Cloud and Grid
Compung, 2017.